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Abstract 


In  2-D  signal  processing,  FIR  filters  are  the  standard  choice  for  implementing  linear  shift-invariant 
(LSI)  systems,  as  they  retain  all  of  the  advantages  of  their  1-D  counterparts.  The  same  is  not  true  in 
general  for  HR  filters.  In  particular,  with  the  exception  of  the  class  of  causal  recursively  computable 
filters,  implementation  issues  for  2-D  HR  filters  have  received  little  attention.  Moreover,  the  forced 
causality  and  apparent  complexity  in  implementing  even  recursively  computable  filters  has  severely 
limited  their  use  in  practice  as  well.  In  this  paper,  we  propose  a  framework  for  implementing  2-D  LSI 
systems  with  2-D  noncausal  HR  filters,  i.e.,  filter  systems  described  implicitly  by  a  difference  equation 
and  boundary  conditions.  This  framework  avoids  many  of  the  drawbacks  commonly  associated  with 
2-D  HR  filtering.  A  number  of  common  2-D  LSI  filter  operations,  (such  as  low-pass,  high-pass, 
and  fan  filters),  are  efficiently  realized  and  implemented  in  this  paper  as  noncausal  HR  filters.  The 
basic  concepts  involved  in  our  approach  include  the  adaptation  of  so-called  direct  methods  for  solving 
partial  differential  equations  (PDF’s),  and  the  introduction  of  an  approximation  methodology  that  is 
particularly  well-suited  to  signal  processing  applications  and  leads  to  very  efficient  implementations. 
In  particular,  for  an  input  and  output  with  N  x  N  samples,  the  algorithm  requires  only  0{N‘^) 
storage  and  computations  (yielding  a  per  pixel  computational  load  that  is  independent  of  image 
size),  and  has  a  parallel  implementation  (yielding  a  per  pixel  computational  load  that  decreases 
with  increasing  image  size).  In  addition  to  its  uses  in  2-D  filtering,  we  believe  that  this  approach 
also  has  applications  in  related  areas,  such  as  geophysical  signal  processing  and  linear  estimation, 
or  any  field  requiring  approximate  solutions  to  elliptic  PDF’s. 

1  Introduction 


For  two-dimensional  signal  processing  applications,  finite  impulse  response  (FIR)  filters  have  been  overwhelm¬ 
ingly  preferred  to  infinite  impulse  response  (HR)  filters  [3,10,12].  This  preference  is  markedly  different  from 
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that  encountered  in  1-D  signal  processing,  in  which  1-D  HR  filtering  plays  a  significant  role.  In  1-D,  an 
extensive  array  of  techniques  exists  for  optimally  designing  both  FIR  and  HR  filters.  Both  FIR  and  HR 
filters  have  efficient,  stable  implementations.  FIR  filters  are  always  stable  and  can  be  implemented  with  the 
FFT.  1-D  HR  filters  can  be.  implemented  in  a  recursive  manner,  and  often  require  less  memory,  lower  order, 
and  fewer  computations  per  sample  than  their  FIR  counterparts. 

In  2-D,  however,  the  design,  analysis,  and  implementation  of  filters  is  complicated  by  a  number  of 
factors,  and  for  the  most  part  practical  2-D  filter  design  and  implementation  has  focused  on  FIR  filters. 
Among  the  reasons  for  this  preference  are;  (a)  efficient  implementation  of  FIR  filters  can  be  accomplished 
equally  well  in  1-D  or  2-D  through  the  use  of  the  FFT;  and  (b)  FIR  filters  do  not  require  any  notion  of 
recursion  or  ordering  of  the  sample  points  (in  1-D  or  in  2-D)  in  order  to  be  implemented.  In  contrast,  for  HR 
filters  the  1-D  and  2-D  cases  appear  to  be  dramatically  different,  and,  in  particular,  both  points  (a)  and  (b) 
become  serious  obstacles  that  have  limited  the  investigation  of  2-D  HR  filters  and  led  many  to  argue  that 
they  cannot  be  implemented  in  practice  (see  [3,10,12]  for  more  on  these  and  related  issues). 

To  understand  these  issues,  as  well  as  our  approach  to  dealing  with  them,  consider  an  HR  filter,  in 
1-D  or  2-D,  specified  in  terms  of  a  difference  equation.  In  either  case,  of  course,  the  difference  equation 
by  itself  isn’t  sufficient  to  completely  specify  the  filtering  algorithm,  as  one  must  also  specify  a  set  of 
boundary  conditions.  In  1-D,  for  the  most  part  these  are  specified  as  a  set  of  initial  conditions,  leading  to 
causally-recursive  filtering  algorithms  with  computational  load  per  sample  point  proportional  to  the  order 
of  the  difference  equation  or,  equivalently,  the  number  of  initial  conditions  required.  Moreover,  even  for 
noncausal  1-D  filters  —  e.g.,  zero-phase  HR  filters  —  implementation,  accomplished  in  this  case  through 
the  combination  of  a  causal  recursion  (with  associated  initial  conditions)  and  an  anti-causal  recursion  (with 
“final”  conditions),  results  in  a  per-sample  computational  load  proportional  to  filter  order  or,  equivalently, 
to  the  total  number  of  boundary  (initial  and  final)  conditions. 

In  contrast,  in  2-D  the  dimension  of  the  required  boundary  conditions  depends  not  only  on  the  order 
of  the  difference  equation  but  also  on  the  size  of  the  boundary  —  i.e.,  the  dimensions  of  the  2-D  domain 
of  interest  —  implying  an  apparently  significant  increase  in  computational  complexity.  In  addition,  since  in 
most  2-D  applications  there  is  no  natural  ordering  of  the  sample  points  and  no  natural  direction  for  recursion, 
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there  is  no  reason  to  expect  that  the  boundary  conditions  would  separate  into  anything  that  might  resemble 
“initial”  or  “final”  conditions,  but  rather  would  more  naturally  be  distributed  around  the  entire  2-D  domain, 
leading  to  2-D  noncausal  HR  (2DNC-IIR)  filters  that  are  not  recursively  computable. 

On  the  other  hand,  if  effective  methods  of  implementation  for  2DNC-IIR  filters  were  available,  there 
would  be  numerous  possibilities  for  their  application.  For  example,  one  potential  advantage  of  HR  filters 
that  is  retained  in  2-D  is  that  a  given  set  of  frequency  response  characteristics  typically  may  be  met  by  an 
HR  filter  of  considerably  lower  order  than  a  corresponding  FIR  design.  Moreover,  2DNC-IIR  filters  arise 
naturally  in  applications  such  as  the  modeling  of  random  fields  for  image  processing  [2, 16]  and  computer 
vision  [1,9].  With  potential  applications  like  these  as  our  motivation,  in  this  paper  we  present  an  approach 
to  the  efficient  implementation  of  2DNC-IIR  filters  that  overcomes  the  difficulties  we  have  described,  thus 
offering  the  possibility  of  recapturing  in  2-D  the  computational  advantages  and  flexibility  that  HR  filters 
have  in  1-D. 

The  key  to  our  approach  is  the  recognition  of  both  the  similarities  and  differences  between  the  im¬ 
plementation  of  2DNC-IIR  filters  and  the  solution  of  finite  difference  and  finite  element  approximations  of 
PDF’s.  In  particular,  the  equations  resulting  from  such  PDF  methods  are  equivalent  to  the  difference  equa¬ 
tions  for  2DNC-IIR  filters,  and  thus  the  many  methods  that  have  been  developed  for  the  efficient  solution 
of  such  PDF’s  can  be  brought  to  bear  in  developing  implementation  concepts  for  2DNC-IIR  filters.  These 
methods  by  themselves,  while  offering  considerable  savings  in  computational  complexity  and  storage,  may 
not  reduce  these  loads  enough  to  make  2DNC-IIR  filters  attractive.  However,  by  taking  advantage  of  a 
fundamental  difference  in  objective  between  solving  PDF’s  and  performing  2-D  filtering,  we  can  reduce  the 
computational  complexity  even  further,  resulting  in  implementations  with  complexity  per  2-D  data  point 
independent  of  domain  size  —  the  same  attractive  feature  as  in  1-D.  In  particular,  while  the  focus  in  PDF’s 
is  typically  on  obtaining  numerically  very  accurate  solutions  to  specific  2-D  difference  equations,  and  hence 
accurate  solutions  to  fluid  flows  or  similar  physical  models,  in  2-D  the  difference  equation  is  not  the  funda¬ 
mental  object.  That  is,  in  filtering  we  generally  begin  with  filter  or  frequency  response  specifications  and 
design  or  choose  a  2-D  difference  equation  that  meets  these  specifications  to  within  some  given  tolerances. 
Consequently,  approximations  to  the  solution  of  the  difference  equation  are  acceptable  as  long  as  they  lead 
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to  filters  that  also  meet  the  desired  tolerances,  and  in  this  paper  we  describe  a  method  for  approximating 
solutions  to  2-D  difference  equations  that  both  meets  the  desired  filtering  objectives  and  also  results  in  very 
efficient  implementations. 

In  the  next  section,  we  introduce  the  class  of  2DNC-IIR  difference  equations  and  briefly  discuss  the 
issue  of  the  choice  of  boundary  conditions  for  such  an  equation.  In  Section  3,  we  make  explicit  the  connection 
of  the  problem  of  interest  here  and  the  methods  for  solving  sparse  linear  systems  of  equations  such  as  those 
arising  in  finite  difference  methods  for  PDE’s.  In  Section  3,  we  also  introduce  one  of  the  constructs  used  in 
direct  solution  of  such  equations,  namely,  the  organization  of  2-D  data  points  into  1-D  arrays  or  columns  and 
the  ordering  of  these  1-D  data  sets  in  ways  that  lead  to  efficient  solution  procedures  through  the  sequential 
processing  of  these  1-D  sets  of  variables.  The  dimensionality  of  these  1-D  data  sets,  of  course,  is  quite  large, 
corresponding  to  the  linear  dimension  (width  or  length)  of  the  2-D  domain.  However,  if  we  view  each  of  these 
sequential  processing  steps  as  being  itself  a  1-D  processing  procedure  along  the  1-D  data  set,  we  are  led 
to  the  idea  of  approximating  this  step  using  low-order  HR  filtering  methods.  This  idea,  which  is  developed 
in  Section  4,  results  in  very  efficient  2DNC-IIR  filtering  procedures  applicable  to  a  large  class  of  noncausal, 
nonseparable  filtering  applications.  In  particular,  in  Section  5  we  illustrate  the  efficient  implementation 
of  several  zero-phase  2DNC-IIR  filters.  Zero-phase  filters  are  of  considerable  interest  in  practice,  and  the 
apparent  difficulty  in  implementing  HR  filters  with  zero  phase  has  often  been  cited  as  one  of  the  reasons  that 
FIR  filters  are  commonly  used.  As  our  results  here  indicate,  we  now  can  implement  zero-phase  IIR  filters 
efficiently,  removing  a  major  obstacle  to  their  use  in  practice. 

2  Two-Dimensional  IIR  Filters  Given  by  Difference  Equations 

A  rich  class  of  2DNC-IIR  filters  can  be  described  implicitly  through  2-D  linear  constant-coefficient  difference 
equations  (LCCDE’s)  of  the  form 

Li  L2  ^2 

^hhVi'i-hJ  -h]=  m  brmm2x[i-Tni,j  -  m2].  (1) 

li  =  —  Li  l2  =  -L2  mi=-Mi  ni2  =  -M2 
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Such  an  equation  corresponds  to  a  2-D  system  function  of  the  form 


H{zi,Z2)  - 


E 


Ml 

mi  =  — Ml 


EM2 

1112  =  —  M2 


ELi 

li  =  —  Li  Z-ul2~  —  L2 


u  -m2 

^mim2  *2 


(2) 


Equation  (1),  of  course,  provides  only  a  partial  specification  of  a  system,  as  it  must  be  accompanied  by  a 
set  of  boundary  conditions  (EC’s),  and  it  is  here  that  we  see  a  significant  difference  from  the  1-D  case.  In 
particular,  for  a  1-D  difference  equation  with  system  function  H{z),  the  nature  of  the  boundary  conditions 
is  completely  and  very  simply  specified  if  we  require  the  system  to  be  stable  so  that  it  has  a  well-defined 
frequency  response.  In  this  case,  poles  of  H{z)  inside  the  unit  circle  correspond  to  causal  parts  of  the 
system  requiring  initial  conditions,  while  poles  outside  the  unit  circle  correspond  to  anticausal  parts  of 
the  system  requiring  final  conditions.  Thus,  in  general,  the  difference  equation  corresponding  to  H{z)  will 
require  conditions  on  both  boundary  points  of  the  data  interval  of  interest.  In  analogy,  the  specification  of 
a  stable  system  corresponding  to  (1)  requires  the  imposition  of  EC’s  around  the  entire  boundary  of  the  2-D 
domain  of  interest.  Of  course  the  lack  of  a  factorization  theorem  for  polynomials  in  two  variables  makes  the 
specification  of  the  appropriate  EC’s  a  more  challenging  problem  than  in  1-D.  Nevertheless,  in  Sections  4 
and  5,  we  illustrate  a  number  of  important  examples  for  which  appropriate  EC’s  are  easily  specified.  For 
these  examples  and  in  general,  the  EC’s  cannot  be  organized  simply  as  sets  of  “initial”  or  “final”  conditions, 
e.g.,  as  considered  in  [3,12].  Conseciuently,  no  simple  recursive  solution  is  possible,  and  all  of  the  output 
values  y[i,j]  must  be  computed,  in  principle,  simultaneously. 

As  in  any  practical  application  in  1-D  or  2-D,  the  data  domain  of  interest  to  us  is  assumed  to  be 
bounded.  While  our  approach  can  be  applied  to  a  non-square  and  non-rectangular  regions,  for  simplicity  in 
the  presentation  here  we  focus  on  the  case  in  which  the  filter  difference  equation  is  supported  on  the  square 
region  of  N  x  N  samples 

Div  =  {(f,j)  I  l<z<lV,  1<J<A^}.  (3) 


Eoundary  conditions  must  then  be  specified  around  the  boundary  dflN  of  Q,n,  which  in  analogy  with  finite 
difference  methods  can  be  thought  of  as  an  annular  region  of  some  width  around  Q.m,  where,  roughly  speaking, 
the  width  of  the  annular  region  depends  upon  both  the  order  of  the  difference  equation  and  the  nature  of 
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the  boundary  conditions.  For  example,  in  1-D  a  second-order  difference  equation  might  require  two  initial 
conditions,  2/[— 1]  and  2/[0],  (and  hence  a  “boundary”  of  width  two  at  the  left  end  of  the  interval  of  interest),  or 
one  initial  and  one  final  condition,  j/[0]  and  r/[Af-|-l],  respectively,  (corresponding  to  a  boundary  of  width  one 
at  either  end).  For  2DNC-nR  filters,  two  types  of  boundary  conditions  follow  directly  from  finite  difference 
approximations  to  Dirichlet  and  Neumann  conditions,  which  are  commonly  imposed  on  elliptic  PDF’s  to 
guarantee  unique  solutions.  The  discrete  Dirichlet  conditions  (“of  width  one”)  set  the  value  of  y\i,j]  on 
the  boundary  of  Dw  of  width  one  defined  by 

aD]v  =  {(bi)  he  {0,7V-H},  jG[o,iV  +  i]}  u  {(f,j)  h  e  [o,iV  + 1],  je{o,N  +  i}}, 

as  illustrated  in  Figure  1.  The  discrete  Neumann  conditions  correspond  to  a  finite- difference  approximation 
of  the  derivative  of  the  filter  output  normal  to  the  boundary  of  Dw  Thus  Neumann  conditions  locally 
constrain  the  values  of  y[i,j]  as  follows: 

y[i,j]-y[pii,j),Q{iJ)]^r[i,j],  {i,j)Edn]^,  (4) 

where  is  the  closest  point  in  Dat  to  the  boundary  point  As  can  be  seen  from  Figure  1, 

Equation  (4)  provides  a  local  constraint  between  each  of  the  output  values  in  the  outermost  square  (rep¬ 
resented  by  o’s)  and  their  nearest  neighbor  in  (represented  by  the  outermost  square  of  •’s).  Although 
we  will  focus  upon  Dirichlet  and  Neumann  conditions  in  this  paper,  the  number  of  possible  choices  is  much 
greater.  The  suitability  of  particular  choices  for  the  EC’s,  however,  is  a  function  of  the  filter  characteristics; 
(the  same  is  true  in  1-D,  but  the  choices  for  EC’s  are  far  more  limited). 
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Figure  1:  The  elements  of  fljv  (denoted  by  •)  and  dQ\f  (denoted  by  o). 

3  Implementing  Noncausal  HR  Filters  as  Linear  Systems  of  Equa¬ 
tions 

3.1  Direct  vs.  Iterative  Methods 

In  this  section  we  examine  the  problem  of  implementing  2DNC-IIR  filters  and,  in  particular,  make  precise  the 
connection  between  this  problem  and  the  general  problem  of  solving  large,  sparse,  sets  of  linear  equations, 
in  particular  those  arising  in  the  solution  of  linear  PDF’s.  The  methods  that  result  from  this  connection  are 
quite  broadly  applicable.  For  example,  our  methodology  can  be  used  for  linear  difference  equations  which 
are  not  constant-coefficient,  for  regions  of  support  ft  which  are  non-square  and  irregularly  sampled, 
and  for  various  types  of  boundary  conditions.  However,  for  notational  simplicity  in  this  and  the  following 
sections,  unless  stated  otherwise,  we  assume  that  the  difference  equation  is  LCCDE,  that  D  =  ffjv,  and  that 
the  boundary  conditions  are  of  the  Dirichlet  type.  Some  of  these  assumptions  are  relaxed  in  the  examples 
of  Section  5.  Also,  since  our  focus  here  is  on  the  HR  nature  of  filters,  for  clarity  of  exposition  we  make 
one  final  assumption,  namely,  that  bmim2  —  in  (1),  where  is  the  Kronecker  delta  function. 

Since  implementing  the  right-hand  side  of  Equation  (1)  is  equivalent  to  implementing  an  FIR  filter,  a  more 
complicated  right-hand  side  adds  only  notational  but  not  conceptual  complexity.  These  assumptions  lead  to 
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a  2DNC-IIR  filter  system  given  algebraically  by 


EfiL-L,  Etl-L2  2/[«  -  hj  -  y  =  x[ij] ,  {i,j)  e  Qn 


2/[bj]  =  r[i,j] , 


{i,j)  e  dO,\j 


(5) 


where  r\i,j]  is  known. 

Equation  (5)  can  be  cast  in  matrix  form  as 

Ay  =  X  +  r  =  6 ,  (6) 

where  the  non-zero  elements  of  A  are  the  filter  coefficients  aq;,.  Vectors  x  and  y_  contain  the  filter  input 
x[i,j]  and  output  y[i-,j],  respectively,  in  n/v,  and  r  contains  the  contribution  of  the  Dirichlet  conditions 
entering  through  the  filter  difference  equation.  Since  the  focus  here  is  on  LSI  systems,  we  set  r  equal  to 
zero  to  preserve  linearity.  The  order  in  which  the  variables  y[i,  k\  appear  in  y_  is  the  ordering  of  Qat,  or  the 
ordering  of  A.  For  direct  methods  (defined  in  the  following  paragraph),  this  ordering  can  drastically  alter 
the  apparent  complexity  of  the  implementation. 

Note  that  the  matrix  A  has  dimension  N'^  x  .  A  nice  property  of  HR  filters  is  that  they  generally 
require  a  small  number  of  coefficients,  so  that  Li  -C  V  and  L2  <C  N.  In  other  words,  A  will  be  very 
sparse.  This  obviously  suggests  the  use  of  numerical  methods  developed  for  solving  large  sparse  systems, 
like  Equation  (6),  that  take  advantage  of  this  sparsity  to  minimize  computational  and  storage  requirements. 
In  particular,  there  are  two  distinct  classes  of  methods  for  calculating  the  output  y  in  (6),  i.e.,  for  implicitly 
calculating  A~^b.  This  calculation  can  be  performed  by  either  iterative  or  direct  methods.  Iterative  methods 
begin  with  an  estimate  of  y,  and  produce  at  each  step  an  estimate  y_^  which  theoretically  converges  as 
limfc__oo  =  y  \  however,  in  practice  the  series  must  converge  within  a  tolerable  error  in  a  finite  number 
of  steps.  Direct  methods  consist  of  variants  of  the  LU  factorization  [6, 7]  (Gaussian  elimination  followed  by 
back-substitution),  and  produce  the  exact  solution  (disregarding  numerical  errors)  in  a  finite  number  of  steps. 
Assuming  both  methods  produce  a  solution  within  a  desired  level  of  accuracy,  their  relative  performance  is 
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measured  in  terms  of  storage  and  computational  requirements. 

A  comparison  of  direct  and  iterative  methods  is  impossible  without  knowing  the  specific  applications. 
Iterative  methods  require  little  storage  beyond  that  to  store  A,  b,  and  y^,  implying  much  less  storage  than 
direct  methods  usually  require.  The  storage  and  computational  complexity  of  direct  methods  is  determined 
by  the  amount  of  fill-in  which  occurs  during  the  factorization,  i.e.,  the  loss  of  sparsity  that  occurs  in  the 
Gaussian  elimination  process  (see  [4]  for  a  complete  discussion  of  fill-in).  The  amount  of  fill-in  is  strongly 
dependent  upon  the  ordering  of  A,  and  the  usefulness  of  a  particular  ordering  depends  upon  the  particular 
application. 

For  signal  processing  applications,  the  same  filter  is  typically  applied  to  a  large  number  of  inputs. 
Thus  while  b  varies  in  (6),  A  remains  fixed.  For  such  applications,  A  must  be  factored  only  once.  This 
factorization  can  then  be  done  off-line,  in  which  case  the  factorization  costs  can  either  be  considered  part 
of  the  filter  design  process  or  amortized  over  the  large  number  of  inputs.  This  property  of  direct  methods 
motivates  us  to  focus  here  on  direct  implementations  of  2DNC-IIR  filter  systems.  Iterative  methods,  such 
as  preconditioned  conjugate  gradient  or  multi-grid,  might  be  just  as  or  more  effective  for  some  applications 
(especially  for  3-D  problems),  but  we  show  here  that  direct  methods  allow  for  very  efficient  implementations 
of  a  large  number  of  filters. 

The  primary  reason  for  using  the  LU  factorization  of  A  for  direct  methods  is  that  triangular  systems 
are  easy  to  solve  [4].  The  LU  factorization 

A  =  LU  (7) 

yields  a  unit  lower-triangular  matrix  L  and  an  upper-triangular  matrix  U.  The  LU  factorization  of  a  dense 
M  X  M  matrix  requires  2M^/3  computations,  measured  in  terms  of  floating  point  adds  and  multiplies. 
Storage  can  be  done  in  place  of  A,  such  that  no  extra  storage  is  needed  beyond  that  required  for  A.  Given 
the  LU  factorization  of  A,  the  solution  to  Ay  =  b  can  be  found  very  efficiently.  In  particular,  if  A  has 
dimension  M  x  M,  solving  the  following  two  triangular  systems  requires  only  2M^  computations: 

LQ  —  b,  “Forward  substitution”  (8) 
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Uy 


Back  substitution' 


(9) 


=  C- 

Note  that  if  we  solved  for  y  by  explicitly  computing  A~^  and  then  computing  A~^b,  2M®  +  2M^  computations 
would  be  required.  Thus  the  computational  savings  which  result  from  the  LU  decomposition  solution  to 
Ay  —  b  is  modest  for  dense  A.  However,  if  A  is  sparse,  as  for  2DNC-IIR  filter  systems,  the  savings  in  both 
storage  and  computations  can  be  tremendous  (orders  of  magnitude)  [4] .  In  particular,  if  A  is  sparse,  then  L 
and  U  will  be  sparse,  especially  if  proper  orderings  are  used.  Amortizing  the  costs  of  the  factorization  over  a 
large  number  of  filter  inputs  further  decreases  the  effective  computation  requirements  for  the  LU  approach. 
However,  note  that  in  our  application  M  =  N^,  the  number  of  2-D  data  points,  and  thus  computations  of 
order  greater  than  linear  in  M  can  still  make  this  approach  prohibitive.  Fortunately,  as  we  will  see,  in  the 
context  of  2-D  filtering  there  are  natural  and  very  accurate  approximations  to  the  LU  factorization  approach 
just  described  that  do  result  in  total  complexity  that  is  linear  in  M,  (or,  equivalently,  quadratic  in  N). 

Finally,  note  that  a  simple  generalization  of  the  LU  factorization  is  the  block  LU  factorization,  in 
which  L  and  U  are,  respectively,  lower  and  upper  block  triangular.  We  will  make  use  of  such  a  block 
factorization  in  the  next  subsection,  in  which  the  block  size  is  proportional  to  the  linear  dimension  of  the 
domain  of  interest.  In  this  case,  of  course,  the  complexity  of  the  calculations  implied  by  (8)  and  (9)  increase 
with  increasing  block  size,  and  it  is  this  fact  that  motivates  the  approximation  introduced  in  Section  4. 

3.2  Columnwise  Orderings 

The  particular  ordering  choice  that  we  make  here  is  motivated  by  the  structure  of  the  2-D  difference  equation 
(5),  which  will  be  exploited  in  Section  4.1  for  an  approximate  solution.  To  make  the  discussion  explicit  we 
focus  here  on  a  common  low-order  2-D  LCCDE,  the  9-point  nearest  neighbor  model  (NNM).  This  difference 
equation  also  arises  quite  frequently  in  engineering  applications,  most  notably  as  the  first-order  and  second- 
order  finite  difference  and  finite  element  approximations  to  elliptic  PDF’s  [6,9,11,14].  The  constant-coefficient 
form  of  the  9-point  NNM  is  given  by 

cy[i,j]  =  ny[i,j  -b  1]  +  sy[i,j  -  1]  -b  ey[i  +  l,j]  +  wy[i-  l,j]  (10) 
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y[i-i,j+i]  y[ij+i]  y[i+io+i] 


y[i-ij-i]  y[io-i]  y[i+ij-i] 


Figure  2:  The  output  mask  for  the  9-point  NNM  difference  equation. 


+  ney[i  +  l,j  +  1]  +  n^y[i  -l,j  +  1]  +  Sey[i  +  l,i  -  1]  +  -  l,i  -  1] 

+  x[i,j] , 


where  the  coefficients  are  labeled  according  to  the  directions  of  a  compass.  The  output  mask  of  this  difference 
equation  is  illustrated  in  Figure  2.  Note  that  the  LSI  system  characterized  by  the  frequency  response  from 
difference  equation  (10)  is  zero-phase  if  n  =  s,  e  =  lu,  Ug  =  Sw,  and  Uw  =  Sg-  Most  of  the  filters  implemented 
in  Section  5  are  zero-phase.  This  model  will  be  the  used  for  the  algorithmic  development  and  simulations  in 
this  paper,  but  the  results  readily  generalize  to  filters  of  higher-order  or  shift-varying  difference  equations. 

Consider  a  2DNC-IIR  filter  supported  on  a  square  grid  0^,  with  the  difference  equation  given  by  (10) 
and  boundary  conditions  in  Dirichlet  form,  i.e.,  y{i,  j]  =  r[i,j]  on  Sfl),/.  If  the  filter  input  and  output  variables 
defined  on  are  ordered  columnwise  into  N  xl  dimensional  vectors  yi  =  [y[i,l],y[i,2], . . .  ,y[i,N]]'^  and 
Xi  =  [x[i,l],x[i,2], . . .  ,x[i,N]]'^,  respectively,  the  2-D  filter  system  is  represented  algebraically  in  the  form 
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of  Equation  (6)  as^ 


Di  E2 

1/1 

Xi 

bi 

Cl  D2  E3 

1/2 

X2 

b2 

= 

H-r  = 

En 

1/JV-l 

xn~i 

bN-i 

Cn-1  Dn 

VN 

Xn 

✓  s. 

bN 

A 

y 

X 

b 

(11) 


As  in  Equation  (6),  the  vector  r  contains  the  Dirichlet  boundary  values  which  enter  through  the  difference 
equation.  The  structure  of  A  in  (11)  is  block  tridiagonal,  and  the  N  x  N  dimensional  blocks  Ci,  Di,  and  Ei 
are  tridiagonal.  Note  that  Equation  (11)  allows  for  a  space-varying  NNM  difference  equation,  but  for  the 
constant-coefficient  difference  equation  the  subscripts  on  the  blocks  of  A  can  be  dropped.  In  this  case,  the 
non-zero  elements  of  C,  D,  and  E  are  given  by 
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for  1  <  fc,  1  <  Af.  For  higher-order  difference  equations,  a  block  tridiagonal  ordering  similar  to  that  of  (11) 
can  be  created  by  allowing  yi  and  Xi  to  contain  filter  output  and  input  values,  respectively,  for  more  than 
one  column  oi  9, n-  In  this  case  the  blocks  Ci,  Di,  and  Ei  are  no  longer  tridiagonal,  but  will  have  small 
bandwidths  with  the  Reverse-Cuthill-McKee  ordering  (used  for  ordering  the  nodes  of  a  “thin”  graph  [4]). 
Another  possibility,  more  appropriate  for  the  algorithms  discussed  in  Section  4,  is  to  increase  the  number 
of  blocks  in  A  according  to  the  filter  order.  For  a  25-point  NNM  difference  equation,  which  has  an  output 
mask  with  one  more  square  layer  of  output  values  beyond  that  for  the  9-point  difference  equation  illustrated 
in  Figure  2,  A  would  be  block  penta-diagonal 

While  the  block  LU  factorization  of  a  block  tridiagonal  system  is  not  unique  [4,7],  one  particular 
'^For  any  matrix  in  this  paper,  such  cis  A  in  (11),  block  entries  not  indicated  are  zero. 
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choice  leads  to  a  simple  recursive  algorithm  for  the  computation  of  the  successive  blocks  of  the  factorization. 


Specifically,  the  quantities  Di,  . 

. .,  Dn  and  E2, 

. . En  needed  in  this  factorization 

are  recursively  computed 

from  the  following  equations^: 

DiEiJ^-i 

=  Ei+i 

(12) 

Di+\ 

=  —  C'l-E'i-flj 

(13) 

where  the  recursion  is  initialized  with  Di  =  Di.  Thus  the  recursion  begins  by  solving  (12)  for  E2,  then 
computing  ^2  from  (13),  etc.  Note  that  for  these  recursions  to  be  well-posed,  Di  fox  i  =  I, N  must 
be  invertible.  Conditions  which  guarantee  this  are  discussed  in  [7].  In  the  filtering  examples  presented  in 
Section  5,  the  matrices  Di  are  always  invertible.  Also,  solving  (12)  at  each  step  is  performed  by  an  LU 
factorization  on  Di,  and  indeed,  as  we  discuss  next,  this  factorization  Di  =  LuUa  is  needed  on-line. 

The  block  LU  factorization  resulting  from  the  procedure  just  described  is  given  by 

I  E2 

I  En 
I 

While  implementing  the  recursions  (12)  and  (13)  is  conceptually  straightforward,  the  number  of  computations 
and  storage  elements  needed  to  implement  the  off-line  factorization  (14)  can  be  overwhelming.  This  burden 
is  a  direct  result  of  the  columnwise  ordering,  which  leads  to  a  destruction  of  the  sparsity  of  A  during  the 
factorization.  Although  Ci,  D^,  and  Ei  are  very  sparse,  with  the  exception  of  Di  =  Di,  the  matrices  Di  and 
Ei  are  generally  full.  The  storage  of  these  matrices  required  for  the  on-line  solution  is  thus  0{N^).  Also, 
each  step  of  the  recursion  requires  the  factorization  of  the  full  matrix  Di  in  order  to  compute  .  Each 
such  factorization  requires  0{N^)  computations,  leading  to  0{N'^)  computations  overall. 

The  lack  of  sparsity  in  the  blocks  of  (14)  implies  a  large  computational  burden  for  the  on-line  solution. 

^The  validity  of  the  recursions  (12)  and  (13)  can  be  verified  directly  by  equating  A  in  (11)  with  the  expression  in  (14). 
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First  note  that,  since  the  factorization  =  LuUu  is  needed  at  each  step  of  the  recursion  (12)-(13)  to 
compute  -Ei+i,  these  factors  can  be  stored  in  place  of  Di  in  (14).  The  solution  to  (11)  is  then  given  by 
forward-substitution,  (initialized  by  CqCo  =  0), 

LuU^i(i  =  bi  —  Ci-iQ-i  i  =  1, . . . ,  N  (15) 

followed  by  back-substitution,  (initialized  hy  un  =  Cn), 

yi  =  -  E^+iyi+i  i  =  N  (16) 

Since  La  and  Uu  are  generally  full,  the  solution  of  (15)  and  (16)  requires  0{N‘^)  computations  per  step 
and  hence  0{N^)  total  computations.  Thus,  not  only  is  the  off-line  computational  load  large 
but  the  on-line  storage  and  computations  are  both  0{N^),  significantly  greater  than  the  0{N‘^)  goal  that 
corresponds  to  constant  per-pixel  computational  and  storage  burden. 

The  0{N‘^)  growth  in  computations  and  0{N^)  growth  in  storage  makes  the  columnwise  ordering 
procedure,  as  we  have  described  it,  infeasible  for  even  modestly  sized  2-D  filtering  problems.  For  exact 
solutions  to  (11),  alternative  orderings  or  iterative  implementations  must  be  employed  (see  Section  6). 
However,  if  an  approximate  solution  can  be  tolerated,  the  block  LU  factorization  based  on  the  columnwise 
ordering  leads  to  an  efficient  approximation  strategy  which  achieves  the  goal  of  O(iV^)  storage  elements 
and  0{N^)  computations  for  both  the  off-line  factorization  and  the  on-line  solution.  Before  describing  the 
approximation  strategy  in  detail  (see  Section  4),  we  can  motivate  its  development  by  closely  examining  a 
single  stage  of  the  on-line  solution.  Equation  (15)  requires  first  solving  the  lower  triangular  equations 

LiiZi  —  b-i  ,  (f^) 

followed  by  solving  the  upper  triangular  system 

UiiQ  =  2:i  .  (18) 
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Recall  that  we  have  organized  our  variables  into  1-D  columns,  and  thus  the  solution  of  the  lower  triangular 
system  (17)  can  be  thought  as  a  causal  1-D  recursion,  beginning  at  the  bottom  of  the  column  {j  =  1)  and 
proceeding  recursively  to  the  top  of  the  column  {j  =  N).  The  upper  triangular  system  (18)  thus  corresponds 
to  an  anticausal  recursion  proceeding  from  top  to  bottom.  The  back-substitution  filtering,  Equation  (16), 
requires  implementing  an  FIR  filter,  in  the  form  of  a  matrix  multiplication,  along  a  single  column  of  fliv- 

Thus  we  can  view  (17)  and  (18)  as  1-D  recursive  filtering  operations,  albeit  shift-varying  recursions, 
since  in  general  La  and  U^i  will  not  be  Toeplitz.  If  La  and  Ua  are  full,  then  the  order  of  these  recursive 
filters  equals  the  length  N  of  the  column,  and  it  is  the  need  to  determine  (off-line)  and  then  implement 
(on-line)  these  high-order  recursions  that  leads  to  the  severe  computational  burden.  However,  if  these 
recursive  1-D  filters  can  be  approximated  by  lower-order  recursions  —  e.g.,  if  Di  and  hence  La  and  Ua 
can  be  approximated  by  banded  matrices,  i.e.,  by  matrices  that  are  zero  except  for  a  band  of  preferably 
small  bandwidth  around  the  main  diagonal,  then  both  the  storage  and  computational  requirements  for  the 
forward-substitution  phase  of  the  on-line  solution  can  be  reduced  to  O^N"^).  For  the  back-substitution,  the 
computational  burden  is  governed  at  each  step  by  multiplication  of  the  matrix  with  r/i+i .  Since 
is  generally  full  and  not  Toeplitz,  this  operation  will  require  0{N'^)  computations  per  step.  However,  if 
we  similarly  approximate  Flj+i  with  a  lower-order  FIR  filter,  i.e.,  by  approximating  Ei+i  with  a  banded 
matrix,  the  total  computational  and  storage  requirements  for  the  on-line  solution  reduce  to  the  0{N^)  goal 
we  desire. 

Note,  however,  to  reduce  the  off-line  computational  load  to  0{N‘^),  it  is  not  sufficient  that  Di  and  Ei 
are  well  approximated  matrices  with  narrow  bandwidth;  a  method  must  exist  for  determining  these  approx¬ 
imate  matrices  in  0{N)  computations  per  stage.  In  the  next  section,  we  describe  such  an  approximation 
procedure  in  detail  and  also  discuss  conditions  under  which  we  would  expect  the  approximation  to  be  an 
excellent  one.  These  conditions  are  typically  satisfied  in  2-D  filtering  applications,  and  we  illustrate  several 
of  these  in  Section  5. 
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4  Efficient  Implementations  of  2DNC-IIR  Filters 

In  this  section,  we  develop  an  approximate  implementation  of  2DNC-IIR  filters  which  requires  0{N^)  compu¬ 
tations  and  storage.  Both  the  factorization  of  A  given  by  Equations  (12)-(13)  and  the  on-line  solution  given 
by  Equations  (15)-(16)  are  approximated  in  0{N'^)  computations.  A  parallel  approximate  implementation 
is  outlined  in  Section  4.3.  While  the  necessary  and  sufficient  conditions  under  which  the  approximation  is 
valid  are  not  prescribed,  some  analytical  guidelines  are  provided.  These  guidelines  are  used  to  develop  a 
number  of  useful,  low-order  filters  in  Section  5,  which  are  successfully  implemented  with  the  approximation 
algorithm  in  Section  5.  The  generality  of  the  guidelines  and  the  success  of  the  approximation  algorithm 
demonstrated  in  Section  5  leads  us  to  believe  that  the  approximation  is  valid  for  a  significant  class  of  filters. 

4.1  Development  of  the  Approximate  Block  LU  Algorithm 

The  approximate  implementation  of  2DNC-IIR  filters  described  in  this  subsection  is  motivated  by  a  simple 
observation  regarding  the  structure  of  the  LU  factorization  (14).  Namely,  for  many  filters,  a  small  number  of 
elements  in  the  blocks  Ci,  Di,  and  dominate  in  magnitude  the  rest  of  the  elements.  (This  dominance  is 
obvious  for  the  tridiagonal  blocks  Ci  in  Equation  (11).)  An  efficient  approximation  to  the  on-line  solutions 
follows  by  setting  to  zero  the  insignificant  elements  of  La,  Uu,  (remember  that  La  and  Uu  are  stored  in 
place  of  ^i),  and  Ei.  Recursions  (15)  and  (16)  then  can  be  implemented  very  efficiently  if  one  takes  care 
to  avoid  operating  on  the  zero  elements.  Also,  if  the  number  of  significant  elements  in  each  block  is  0{N), 
then  in  general  only  0{N‘^)  elements  of  (14)  will  need  to  be  stored,  implying  that  the  on-line  solution  can 
be  approximated  in  0{N‘^)  operations. 

Unfortunately,  the  approach  of  simply  discarding  the  insignificant  elements  of  (14)  has  significant 
drawbacks.  First,  the  off-line  factorization  will  still  require  0{N‘^)  computations.  Secondly,  searching  for 
the  significant  elements  of  each  block  in  (14)  and  storing  them  in  data  structures  necessary  for  efficient 
implementation  of  the  on-line  solution  can  be  a  costly  procedure.  A  large  class  of  filters,  however,  have  a 
property  which  allows  us  to  overcome  these  difficulties.  In  particular,  for  filters  such  as  those  described  in  this 
paper,  all  of  the  matrices  of  interest  —  its  LU  factors  Lu  and  Uu,  and  Ei  —  can  be  well- approximated 
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by  banded  matrices  of  some  bandwidth  P  N.  Thus,  we  know  a  priori  what  elements  of  these  matrices 
must  be  stored.  Moreover,  since  each  of  these  matrices  has  0{pN)  non-zero  elements,  there  are  0{N)  such 
matrices,  and  P  is  independent  of  N,  the  total  required  storage  is  0{N^),  as  desired.  Furthermore,  as  we 
now  describe,  we  can  compute  each  of  these  approximations  in  0{P‘^N)  or  0{pN)  computations,  resulting 
in  an  overall  computational  load  of  again  as  desired. 

The  key  assumption  required  for  these  approximations  to  yield  good  results  is  that,  for  i  =  1, ...  ,N, 
the  blocks  ^  are  approximately  banded,  i.e.,  well  approximated  by  setting  to  zero  all  the  elements  which 
do  not  fall  within  a  small  (relative  to  N)  bandwidth  of  the  main  diagonal.  For  example,  as  holds  for  many  of 
the  filters  considered  in  this  paper,  the  elements  of  D  ■  ^  decay  geometrically  in  magnitude  with  distance  from 
the  main  diagonal.  If  ^  is  approximately  banded,  then  from  the  recursions  (12)-(13),  it  is  apparent  that 
the  blocks  Di  and  Ei  will  generally  be  approximately  banded  as  well.  Furthermore,  as  the  following  corollary 
states,  if  we  have  a  banded  approximation  to  Dj,  we  can  efficiently  compute  a  banded  approximation  of  its 
inverse. 

Corollary  of  [5]  (See  Appendix  for  proof)  If  D  is  an  N  x  N  matrix  with  bandwidth  P,  the  elements 
of  D~^  which  lie  within  the  bandwidth  p  can  be  computed  (exactly)  in  0{p‘^N)  operations. 

This  leads  to  the  following  approximation  procedure,  replacing  Equations  (12)  and  (13).  Specifically, 
suppose  that  Di  is  an  approximation  to  Di  which  is  P-banded  (i.e.,  banded  with  bandwidth  p).  Note  that 
Di  =  =  Di  is  exactly  banded.  We  then  compute  a  /3-banded  approximation  to  D~^: 


FiDi,P) 


0, 


\k  -l\<p 
otherwise 


(19) 


where  F  :  requires  0{P‘^N)  computations  and  is  called  the  approximate  inverse  operator. 

Assuming  that  Dff^  is  a  good  approximation  to  Dj  ^  and  that  F{Di,p)  is  a  good  approximation  to  Df^, 
then 

E{Di,  P)  ■  Ei.i-1  se  Ei^i-i  .  (20) 


Since  Ei+i  is  tridiagonal,  the  matrix  multiplication  in  (20)  requires  0{pN)  computations.  Note,  however. 
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that  the  approximation  of  £,+i  in  (20)  has  bandwidth  (/3  +  2),  which  will  in  turn  require  a  growing  bandwidth 
at  each  step.  However,  under  the  key  assumption  that  the  matrices  of  interest  are  approximately  /3-banded, 
we  can  neglect  elements  outside  the  /3-bandwidth.  Thus,  define  the  operator  Tp  : 


Tpim 


1 


Hki, 

0, 


|fc-/|  </3 
otherwise 


We  thus  have  the  following  approximation  to  Equation  (12); 


(21) 


E,+,=Tp[F{D,,0)-Ei+i]. 


(22) 


Similarly,  substituting  Ei+i  into  (13)  in  lieu  of  Ei+i  yields 


Di+i  —  CiEiJ^i  w  Hi+i , 


(23) 


requiring  0{PN)  calculations.  Once  again,  the  quantity  in  (23)  is  (/3 -I- 2)-banded,  and  applying  our  assump¬ 
tion  of  /3-bandedness,  we  obtain  our  approximation  to  (13): 

D^+i=Tp[D,+i-QE,+i].  (24) 

Equations  (22)  and  (24)  can  then  be  repeated  iteratively,  each  stage  requiring  0{p^N)  computations.  For  i  = 
1, . . . ,  A^,  Equations  (22)  and  (24)  form  an  approximation  to  (14)  requiring  a  total  of  0{/3^N^)  computations 
and  0{j3N‘^)  storage  elements.  Furthermore,  the  LU  factorization  of  each  ,8-banded  Di  also  requires  only 
0{j3‘^N)  computations,  yielding  /3-banded  lower  and  upper-triangular  matrices  La  and  tJu.  This  process  is 
summarized  as  follows: 

(i)  choose  P  N 

(ii)  for  i  =  1, . . . ,  W  —  1 
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(a)  factor  Di  =  LuUu,  (Di  =  -Dj ); 
store  LU  factors  in  place  of  Di  ', 

(b)  compute  Gi  =  F{Di,p)- 

(c)  compute  Ei+i  =  Tp[G^Ei+i]-, 
store  Ei+x\ 

(d)  compute  Di+i  =  Tfj[D,.+  i  -  C, 
set  to  zero  values  outside  the  bandwidth  P\ 

(iii)  factor  Dn  =  LnnUnn  {2P'^N  computations),  r 


computations/step;  20^ N 
storage/step:  (2/3  +  \)N 
computations/step:  2P‘^N 
computations/step:  3{2P  +  l)N 
storage/step:  {2p  +  l)N 
computations/step:  3{2p  +  3)N 

store  factors  in  place  of  Dm  (2/3+1  storage  elements) 


This  procedure  results  in  the  following  approximation  of  (14) 


1 _ 

1 - 

Cl  D2 

/  En 

1 _ 

(25) 


with  La  and  Uu  stored  in  place  of  Di.  An  approximate  on-line  solution  follows  by  substituting  La  and  Uu 
into  (15)  for  La  and  Uu,  and  Ei  into  (16)  for  Ei. 

While  the  approximation  outlined  by  steps  (i)-(iii)  is  straightforward,  the  utility  and  applicability  of 
the  algorithm  has  not  been  demonstrated.  In  particular,  for  step  (i),  the  approximation  bandwidth  P  must 
be  chosen  according  to  the  desired  level  of  accuracy  of  the  solution  and  also  according  to  the  set  of  filter 
coefficients  (difference  equation  and  EC’s).  How  one  chooses  P  to  achieve  a  desired  level  of  approximation 
accuracy  depends,  of  course,  on  the  application.  If  P  is  too  large,  the  computational  savings  obviously  are 
lost.  What  follows  is  an  example  intending  both  to  suggest  the  class  of  filters  for  which  the  approximate 
implementation  will  offer  significant  savings  and  to  justify  why  in  such  cases  we  expect  to  be  able  to  choose 
a  small  value  of  p  independent  of  the  size  N  of  the  image  domain. 
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4.2  Analysis  of  the  Block  LU  Approximation 


Consider  a  2DNC-IIR  filter  on  flA?  described  implicitly  by  a  9-point  NNM  difference  equation  and  homoge¬ 
neous,  Dirichlet  EC’s  (i.e.,  Equation  (11)  with  r  =  0).  To  make  subsequent  analysis  more  simple,  we  make 
a  slight  deviation  from  the  constant-coefficient  model  of  Equation  (10).  Namely,  assume  that  c  =  (1  -f-  a^) 
and  n  =  s  =  —a  for  all  {i,j)  G  Ojv  and  f  /  1.  For  i  =  1  and  (f,j)  E.  ft n,  the  corresponding  coefficients 
become  c  =  1  and  n  =  s  =  -a.  For  the  point  we  wish  to  make,  the  values  of  the  other  NNM  coefficients  are 
of  no  consequence.  Instead,  we  are  interested  in  the  utility  of  the  block  LU  factorization  for  |a|  <  1.  The 
first  block  row  of  Equation  (11)  follows  as 


la  0  0 

a  1-t-a^  a  0 

0  a  1-f  a^  a 


0  a  1+a^  a 

0  0a  1-l-a^ 


yi  +  E2y2  -  xi  , 


The  factorization  of  4  in  (11)  begins  by  factoring  Di,  i.e.. 


1 

0 

0 

...  0 

1 

a 

0 

...  0 

a 

1 

0 

...  0 

0 

1 

a 

0 

a 

1 

0 

0 

1 

0 

0 

a 

0 

0 

a  1 

0 

0 

0  1 

I'll 


(26) 


(27) 


The  Toeplitz  structure  of  Ln  and  Un  is  a  result  of  the  particular  choice  of  difference  equation,  but  is  not  a 
general  property  even  of  filters  described  by  constant-coefficient  difference  equations. 

The  next  step  of  the  factorization  is  to  compute  £2-  Note  that,  even  though  Di,  Ln,  and  Un  are 
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banded,  this  operation  will  require  0{N'^)  computations  and  0{N‘^)  storage  elements  for  £2-  To  approximate 
the  first  stage  of  the  factorization,  we  only  need  to  compute  the  elements  of  Di  ^  within  a  pre-specified 
bandwidth.  For  this  simple  example,  some  simple  algebraic  manipulations  will  allow  us  to  determine  for 
which  values  of  a  such  an  approximation  is  valid.  Note  that  since  n  =  s,  Di  =  Di  is  symmetric.  Thus, 
Lii  =  Ufi  and  Also,  since  we  can  write  Uu  =  I  —  F,  where  F  is  strictly  upper-triangular, 

the  matrix  identity  (I  -  F)~^  =  I  +  F  +  ■  ■  ■  +  gives 


1  —a 
0  1  —a 


11 


0  0  1 


(-«) 


A'-l 


a 


2 


—a 


0  0  0  1 


The  simple  form  of  Equation  (28)  leads  to  a  single  expression  for  dki  = 


i.e.. 


(28) 


dki  =  {-ay  '=^(-a)2^ 

1=0 

From  this  expression,  the  “clustering”  of  the  significant  elements  du  about  the  main  diagonal  of  can  be 
determined  as  follows  (for  0  <  |a|  <  1): 


|4-i.i|  =  \a\\dki\ 


\dk,i+i  \  <  \a\\dki\ 


k<l, 


(29) 


Thus,  the  smaller  the  value  of  la|,  (equivalently,  the  greater  the  diagonal  dominance  of  Di),  the  tighter  the 
clustering  about  the  main  diagonal  of  and  thus  the  smaller  the  approximation  bandwidth  fi  necessary  to 
obtain  an  approximation  to  Dj  ^  at  a  desired  level  of  accuracy.  Also,  Equation  (29)  shows  that  the  elements 
of  15^^  essentially  decrease  in  magnitude  geometrically  with  distance  from  the  main  diagonal,  where  the  rate 
of  decay  depends  only  upon  |a|  and  not  N . 

How  might  these  observations  be  used  to  screen  or  suggest  filters  appropriate  for  the  approximate 
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block  LU  implementation?  In  analyzing  only  the  first  stage  of  the  factorization,  the  effect  of  the  NNM 
coefficients  other  than  c,  n,  and  s  has  not  been  accounted  for.  In  essence,  in  cases  where  the  matrices  Di 
have  inverses  for  which  all  the  significant  elements  are  tightly  clustered  about  the  main  diagonal,  the  same 
property  does  not  necessarily  hold  for  the  inverses  of  the  matrices  Di  (f  =  2, . . , ,  N),  which  depend  upon  all 
the  filter  coefficients.  However,  a  general  guideline  verified  by  extensive  numerical  simulations,  some  of  which 
are  given  in  Section  5,  is  that  the  approximate  factorization  becomes  more  accurate  (for  fixed  P)  as  the  ratio 
(|n|  4-  |5|)/lc|  decreases.  Furthermore,  the  approximation  appears  to  be  very  useful,  i.e.,  (3  is  small  for  any 
desired  level  of  accuracy,  for  \n\  +  |s|  <  |c|/2.  These  observations  are  consistent  with  the  preceding  example, 
since  (|n|  +  |s|)/|c|  0  as  |a|  ^  0.  Note  that  |c|/(|n|  +  |s|)  is  a  measure  of  the  “degree”  of  diagonal  dominance 

of  the  elements  in  the  blocks  Di.  For  the  non-constant  coefficient  filters,  analogous  observations  apply,  i.e., 
the  approximate  block  LU  factorization  becomes  more  accurate  as  the  degree  of  diagonal  dominance  in  the 
blocks  Di  increases.  The  issue  of  how  to  select  the  value  of  /?  is  discussed  in  Section  5,  where  we  show  that 
this  approach  can  be  successfully  applied  to  a  variety  of  filtering  tasks  ranging  from  low-pass  filtering  to  fan 
filtering  to  high-pass  operations  such  as  edge  enhancement. 

4.3  A  Parallel  Approximate  Implementation 

In  this  section,  we  briefly  discuss  a  straightforward  parallelization  of  the  algorithm  discussed  in  Section  4.1. 
Many  of  the  details  of  the  implementation  are  omitted,  since  they  depend  in  part  upon  the  available  computer 
architectures. 

A  variant  of  the  serial  block  LU  factorization  is  cyclic  block  reduction  [4] ,  which  is  easily  implemented 
in  parallel.  The  block  LU  factorization  proceeds  by  eliminating  columns  yi  sequentially  from  i  =  1  to  i  =  N. 
However,  it  is  possible  to  eliminate  columns  in  the  interior  of  H  independently.  For  example,  consider  again 
the  block  tridiagonal  matrix  A  given  in  (11),  which  corresponds  to  a  2DNC-IIR  filter  described  by  a  9-point 
NNM  difference  equation  and  Dirichlet  EC’s.  Assume  for  simplicity  of  notation  that  N  is  odd.  If  we  order  the 
even  columns  (i/i  for  f  =  2, 4, . . . ,  N  —  1)  last  and  the  odd  columns  {yi  for  f  =  1, 3, . . . ,  AI)  first,  Equation  (11) 
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takes  the  form 


^11  ^12  Vodd  bodd 

=  ■  (30) 

2^21  ^22  Ueven  ^ei;e7i 

Because  of  the  coupling  of  the  NNM  difference  equation,  the  elimination  of  the  odd  columns  of  yi  for  the 


block  factorization  of  (30)  can  be  performed  in  parallel.  Upon  completing  this  step,  the  second  block  equation 


of  (30)  becomes 


(31) 


where  the  superscript  0)  is  used  to  relabel  the  variables  after  the  1-stage  of  the  cyclic  block  reduction.  The 
blocks  of  A^22  given  by 


=  Di  —  Ci-iDi-i  ^ Ei  —  Ei+iDi+i  ^Ci, 

f  =  2,4,., 

cf) 

=  —Ci+iDi+i  ^Ci, 

i  =  2,4,.. 

CO 

1 

(32) 

=  —Ei-iDi^i  ^  Ei- 

f  =  4,6, . . 

..,7V-1 

Note  that,  if  the  difference  equation  is  constant-coefficient,  each  of  the  three  equations  in  (32)  only  needs  to 
be  computed  for  single  value  of  f. 

Since  is  again  block  tridiagonal,  the  odd-even  reordering  can  continue  recursively,  where  roughly 
one  half  of  the  remaining  columns  are  eliminated  in  parallel  at  each  stage  of  the  algorithm.  Thus,  rather 
than  the  N  stages  required  for  the  block  LU  algorithm  (and  corresponding  on-line  solution),  approximately 
log2  N'  stages  are  required  for  block  cyclic  reduction  (and  corresponding  on-line  solution),  and  each  of  the 
columns  in  each  stage  can  be  operated  on  in  parallel.  If  more  coarse-grained  parallelism  is  required,  Qn 
can  be  partitioned  into  M  regions,  where  the  columns  in  each  region  can  be  eliminated  independent  of  the 
other  regions.  (In  the  case  of  cyclic  block  reduction,  M  =  [{N  +  l)/2j .)  An  example  of  partitioning  for  four 
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Figure  3:  H  partitioned  into  four  regions  Ri  which  overlap  with  neighboring  regions  by  one  column  (variables 
in  the  regions  of  overlap  are  denoted  by  •).  For  a  9-point  filter,  the  internal  variables  of  each  region  (denoted 
by  o)  are  independent  of  the  internal  variables  of  other  regions,  allowing  parallel  Gaussian  elimination.  The 
arrows  indicate  the  direction  of  elimination  for  each  of  the  four  processors. 


processors,  M  =  4,  is  illustrated  in  Figure  3.  Note  that  the  four  sub-regions  Ri  have  overlapping  domains, 
so  that  all  but  the  boundaries  can  be  eliminated  independently.  As  shown  by  the  arrows  in  Figure  3,  the 
direction  of  elimination  for  each  region  Ri  is  from  the  center  outwards  to  the  boundary  columns.  In  essence, 
the  boundaries  are  ordered  last,  and  the  resulting  submatrix  for  the  boundaries  after  the  interiors  of  Ri  are 
eliminated  is  block  tridiagonal. 

An  approximate  block  cyclic  reduction  algorithm  follows  for  any  of  these  parallel  structures  by  noting 
the  strong  similarities  between  implementing  Equation  (32)  and  Equations  (12)-(13).  Namely,  if  a  processor 
is  allocated  for  each  of  the  odd  columns  of  the  first  stage  of  the  (parallelized)  approximate  cyclic  block 
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reduction  is: 


(a)  factor  Di  and  compute  F{Di,  /?)  on  processors  i  —  1,3, . . . ,  N; 

(b)  compute  =  F(Di, /?)  Cj-i  on  processors  i  =  3, 5, A^; 

compute  Ei+i  =  F{Di,(3)  Ei+i  on  processors  i  =  1,3, . . .  ,N  —  2-, 

(c)  compute  E^^^  —  -Ti3[Ei  -Bi+i]  on  processors  f  =  3, 5, . . . ,  A^-  2; 

compute  =  -Tp[Ci  Ci-i]  on  processors  i  =  3,5, . . . ,  N -  2-, 

(d)  compute  G,+i  =  C*  E^+i  on  processors  i  =  1, 3, . . . ,  N—  2; 

compute  Ji_i  =  Ei  Ci_i  on  processors  i  =  3,5, . . . ,  N-, 

(e)  compute  =  T0[Di+i-G,+i- J,p.i]  on  processors  f  =  1,3, . . .  ,N—  2; 

Note  that  step  (e)  requires  communication  between  the  processors,  since  and  are  computed  on 
different,  but  “neighboring”,  processors.  For  the  first  stage,  the  computational  load  for  each  processor  is 
0{f3‘^N),  where  the  asymptotic  complexity  is  determined  primarily  by  step  (a).  The  computational  load 
for  each  processor  will  remain  constant  for  subsequent  stages  of  the  algorithm.  Ignoring  inter-processor 
communication  costs  after  each  stage  of  the  recursion,  the  total  factorization  will  require  a  total  computation 
time  of  0{P^Nlog2  N).  Since  there  are  pixels,  the  per  pixel  computation  time  for  the  fully  parallel 
implementation  is  0{/3'^  \og2  N/N),  which  decreases  with  increasing  image  size.  The  on-line  calculations 
also  can  be  performed  in  parallel  with  similar  computational  savings. 

5  Examples  and  Simulations 

In  this  section,  a  number  of  2DNC-IIR  filters  are  presented.  The  filter  difference  equations,  along  with  easily 
specified  EC’s,  yield  frequency  responses  corresponding  to  canonical  and  widely  used  frequency-selective 
filter  classes,  e.g.,  low-pass,  high-pass,  and  fan  filters.  As  we  will  see,  the  frequency-selective  characteristics 
of  these  filters  can  be  achieved  quite  efficiently  by  implementing  the  approximation  algorithm  of  Section  4.1. 
Each  2DNC-IIR  filter  considered  here  has  a  difference  equation  in  the  form  of  Equation  (10).  The  system 


25 


function  for  such  difference  equations  is  given  by 


H{zi,Z2)  —  J,  J  ,  (33) 

2.2  2\ 


^2 

Ti-n} 

—n  —Tie 

Zi  ^ 

22  = 

1 

,  J  = 

~W 

c  —e 

»  Al  = 

1 

Z2~^ 

—  s  —Se 

Zi 

Note  that  the  system  function  (33)  has  no  numerator.  Adding  a  polynomial  z^P  z^  to  the  numerator  of 
(33)  would  obviously  allow  one  to  improve  the  filter  frequency  responses,  such  as  by  narrowing  the  transition 
regions;  however,  our  primary  interest  is  to  implement  the  recursive  portion  of  the  difference  equation,  in 
order  to  justify  both  the  viability  of  2DNC-IIR  filters  and  the  utility  of  the  approximation  given  in  Section  4.1. 
If  n  =  s,  e  =  w,  Ue  =  Sw,  and  =  Sg,  the  frequency  response  is  real  and  follows  as 


^  ’  c  —  2[ncosct;2  +  ecoswi  +  Tie  cos(wi  +  0^2)  +  w-u,  cos(c<;i  —  cja)] 


(34) 


These  filters  have  zero-phase.  Three  examples  of  zero  phase  filters  are  given  by 
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(35) 


where  Hi{zi,Z2)  =  {22  Ji  z^)^^ .  A  fourth  example,  whose  frequency  response  is  not  zero-phase,  is 


0.13 

-0.5 

0.37 

0.50 

2.0 

0.50 

0.13 

-0.5 

0.37 

(36) 


The  coefficients  of  each  filter  are  scaled  such  that  t^,)=(o  0)  ~  ^  desirable  filter  property 

which  prevents  the  filter  from  biasing  image  intensity  [13]. 
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IHjCco,  .coj)!  IH,(co,  ,e02)l 


Figure  4:  64  x  64  point  sampling  of  for  four  filters:  (a)  low-pass  filter  Hi  given  by  Ji;  (b) 

high-pass  filter  given  by  J2;  (c)  edge  enhancer  F3  given  by  J3;  (d)  fan  filter  H4  given  by  J4. 
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The  frequency  responses  of  all  four  difference  equations  are  illustrated  in  Figure  4.  The  frequency 
response  Hi  is  that  of  a  low-pass  filter,  while  H2  corresponds  to  a  high-pass  filter.  The  difference  equation 
for  system  H2  is  identical  to  the  frequency  response  of  the  edge  enhancer  in  [13].  The  frequency  selection  of 
H3  can  also  be  used  to  enhance  edges  aligned  with  the  j  and  i  axes,  while  H4  corresponds  to  the  frequency 
response  of  a  primitive  (low-order)  fan-filter.  Fan  filters  are  commonly  employed  in  seismic  signal  processing 
to  select  wavefronts  by  their  velocity  of  propagation  [13].  As  demonstrated  by  the  examples  given  in  [8,15],  the 
frequency  responses  of  each  of  the  example  filters  could  be  improved,  e.g.,  with  sharper  transition  regions, 
by  using  higher-order  ARMA  difference  equations.  (Higher-order  filters  can  be  implemented,  exactly  or 
approximately,  with  simple  extensions  of  the  algorithms  prescribed  in  Sections  2  through  4.) 

5.1  LSI  Filtering  with  2DNC-IIR  Filters;  the  Effects  of  EC’s 

As  mentioned  previously,  boundary  conditions  must  be  specified  for  each  of  the  four  difference  equations 
corresponding  to  (33)  with  J  =  Ji,  J2,  J3,  Ja-  In  addition,  since  these  filters  will  be  applied  to  images  defined 
over  finite  N  x  N  domains,  we  expect  that  the  responses  of  these  filters  will  exhibit  transient  behavior 
near  the  boundary  of  the  domain.  To  examine  and  illustrate  the  choice  of  boundary  conditions  and  the 
transient  behavior  near  the  boundaries,  we  first  implement  the  low-pass  filter  corresponding  to  Ji  assuming 
homogeneous  Dirichlet  conditions.  Since  low-pass  filters  are  often  used  as  shaping  filters,  the  filter  input  is 
chosen  to  be  White  Gaussian  Noise  (WGN)  with  zero  mean  and  unit  variance. 

The  low-pass  filter  system,  in  the  form  of  Equation  (11)  with  r  =  0,  is  solved  directly  and  without 
approximation  in  MATLAB  using  sparse  matrix  data  types  and  the  minimum  degree  ordering  algorithm.  A 
sample  output  y\i,j]  for  =  64  is  illustrated  in  Figure  5(a).  Ideally,  this  output  signal  will  be  identical  to 
the  signal  obtained  by  frequency-selective  filtering  according  to  the  frequency  response  given  in  Figure  4(a). 
However,  transient  effects  are  introduced  by  the  EC’s.  The  transient  signal  ye  is  defined  by 

2/[bi]  =yisi[i,j]  +  ye[ij],  {i,j)  e  fijv 

where  yui  is  the  output  of  an  FFT  filter  obtained  by  sampling  Hi{e^'^'- ,  For  the  output  signal  y[i,j]  in 
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Figure  5:  The  2DNC-IIR  low-pass  filter  (a)  response  to  WGN,  and  (b)  the  corresponding  transient  signal 
induced  by  the  boundary  conditions. 


Figure  5(a),  the  corresponding  transient  signal  is  illustrated  in  Figure  5(b).  Note  that  the  energy  in  ye[i,j]  is 
concentrated  at  the  boundary  of  fl;v,  and  has  relatively  insignificant  values  away  from  50 jv.  Nearly  identical 
results  are  obtained  for  other  realizations  of  the  WGN  input  signal  or  if  Neumann  conditions  are  imposed 
in  place  of  Dirichlet  conditions. 

The  magnitude  of  the  transient  signal  relative  to  that  of  the  filter  output  is  a  function  of  both 

the  difference  equation  coefficients  and  the  choice  of  boundary  conditions.  Fortunately,  for  the  examples  of 
the  following  section,  boundary  conditions  are  easily  prescribed  which  lead  to  relatively  insignificant  transient 
signals.  In  fact,  for  the  examples  in  which  the  energy  of  the  filter  response  y[i,j]  is  small  near  SQat,  the 
transient  effects  are  barely  distinguishable  from  finite-precision  arithmetic  error. 


5.2  The  Utility  of  the  Approximate  Block  LU  Factorization 

In  this  section,  each  of  the  four  example  filters  described  at  the  beginning  of  Section  5  is  implemented 
with  the  approximate  block  LU  algorithm  of  Section  4.1.  One  can  check  that,  for  each  of  the  example 
difference  equations,  the  coefficients  satisfy  |n|  -I-  |s|  <  |c|/2,  a  condition  argued  in  Section  4.2  to  allow 
efficient  implementation  of  such  filters  by  the  approximation  algorithm. 
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For  analyzing  the  approximation  errors,  two  error  measures  are  given  by 


e  = 


(37) 


where  e  is  the  relative  energy  of  the  approximation  error.  The  vector  e  contains  the  approximation  errors  at 
each  point  in  fijv  and  ||  •  ||p,  p  —  1,2,  is  the  standard  Ip  norm. 

For  simplicity,  homogeneous  Dirichlet  boundary  conditions  are  assumed  in  all  of  the  examples  which 
follow.  For  the  first  three  examples,  assume  N  =  64.  In  the  fourth  example,  the  effect  of  variations  in  N 
upon  the  approximation  accuracy  is  considered. 

Example  1 :  The  Frequency  Selection  of  Low-pass  and  Fan  Filters 

In  this  example,  we  consider  the  response  of  the  low-pass  and  fan  filters  to  single  frequency  sinusoids  of  the 
form 


1  /27rfci 


{i,j)  e  Q.N 


(38) 


The  approximation  errors  are  then  analyzed  for  various  values  of  /?.  Analysis  for  this  example  is  thus  most 
easily  done  in  the  frequency  domain. 

For  each  filter,  two  different  input  signals,  corresponding  to  two  different  choices  for  the  pair  {ki,k2) 
in  (38),  are  chosen  such  that  one  input  lies  in  the  filter  pass-band  and  the  other  in  the  stop-band.  For  the 
low-pass  filter.  Equation  (38)  with  (fci,  fc2)  =  (3, 2)  lies  in  the  pass-band,  while  (fci,  ^2)  =  (25, 20)  is  clearly  in 
the  stop-band  (see  Figure  4(a)).  The  exact  response  of  the  low-pass  filter  to  the  pass-band  input  is  illustrated 
at  the  top  of  Figure  6,  while  the  exact  response  to  the  stop-band  input  is  illustrated  at  the  top  of  Figure  7. 
The  relative  magnitudes  of  the  two  outputs  illustrate  the  9-to-l  selection  ratio  of  the  low-pass  filter.  Also, 
the  transient  signals  —  manifested  by  the  ridges  in  the  DFT’s  —  are  relatively  small;  in  fact,  these  ridges 
are  barely  noticeable  for  the  response  to  the  stop-band  sinusoid. 

The  effectiveness  of  the  algorithm  of  Section  4.1  is  demonstrated  in  Figures  6  and  7  for  /I  =  2  and 
4.  For  each  of  the  two  low-pass  filter  inputs,  the  approximate  filter  solution  ya[hj]  and  the  approximation 
error  e[i,j]  =  y[i,j]  —  ya[i,j]  are  illustrated.  For  both  inputs,  the  approximation  errors  are  small  even  when 
/3  =  2,  and  the  errors  decrease  by  an  order  of  magnitude  when  P  increases  from  2  to  4.  (The  approximation 
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error  will  be  shown  to  decrease  geometrically  as  a  function  /3  in  Example  3.) 

Similar  results  are  obtained  for  the  fan  filter.  For  a  sinusoidal  input  in  the  pass-band  with  (fci, /C2)  = 
(0, 20),  the  exact  filter  response,  the  approximate  responses  for  /?  =  2  and  4,  and  the  corresponding  approxi¬ 
mation  errors  are  illustrated  in  Figure  8.  Figure  9  contains  the  analogous  plots  for  the  stop-band  input  given 
by  {ki,k2)  =  (25,0).  For  both  figures,  the  exact  responses  exhibit  the  frequency-selectivity  given  by  H4  in 
Figure  4(d),  and  the  transient  signals  are  relatively  small.  Also,  as  for  the  low-pass  filter,  the  approximation 
errors  are  small  even  when  fi  =  2  and  decrease  an  order  of  magnitude  when  /?  increases  to  4. 

Example  2:  Edge  Enhancement  of  a  Square  Pulse 

For  this  example,  the  response  of  the  two  edge-enhancing  filters  (J2  and  J3  in  Equation  (35))  to  a  square 
pulse  is  approximated  with  /?  =  2  and  /3  =  4.  The  square  pulse  input  and  the  exact  response  of  filter  J2  are 
illustrated  at  the  top  of  Figure  10.  The  exact  response  of  filter  J3  is  given  in  Figure  11.  Ignoring  machine 
error,  both  of  the  exact  responses  are  identical  to  those  dictated  by  the  frequency  response  of  the  filter 
difference  equations,  i.e.,  there  is  no  transient  signal  induced  by  the  boundary  conditions.  The  absence  of 
the  transient  is  due  to  the  fact  that  the  filter  responses  determined  directly  from  the  frequency  responses 
are  zero  on  the  boundary  of 

The  approximation  errors  are  unrecognizable  in  Figures  10  and  11,  unless  one  closely  examines  the 
approximate  solution  for  system  J3  when  (3  =  2.  In  any  case,  for  both  filters  the  errors  again  decrease  an 
order  of  magnitude  as  (3  increases  from  2  to  4. 

Example  3:  Accuracy  of  the  Approximation  vs.  /3 

To  demonstrate  the  relationship  between  the  approximation  bandwidth  (3  and  the  accuracy  of  the  approxi¬ 
mation,  the  relative  energy  of  the  approximation  error,  defined  by  e  in  (37),  is  plotted  versus  (3  in  Figure  12 
for  each  of  the  four  example  filters.  (For  Figure  12,  the  input  is  WGN  and  the  error  e  is  the  sample  average 
over  10  sample  inputs.)  For  each  filter,  the  errors  decrease  with  /I  at  a  constant  geometric  rate,  which  is 
consistent  with  the  analysis  of  Section  4.2.  In  Section  4.2,  we  argued  that  the  approximation  errors  will 
be  small,  even  for  small  /?,  when  \n\  -h  |s|  <  |c|/2.  In  these  cases,  the  elements  in  the  blocks  Di  and  Ei 
will  typically  decrease  geometrically  in  magnitude  with  distance  from  the  diagonal.  Thus,  for  /3-banded 
approximations  of  these  blocks,  one  would  expect  the  approximation  errors  to  also  decrease  geometrically 
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with  increasing  approximation  bandwidth. 

Example  4:  The  Independence  of  the  Approximation  Accuracy  upon  N 

In  Section  4.1,  the  computational  and  storage  loads  of  the  approximation  algorithm  were  shown  to  be 
O{0^N'^)  and  0{f3N‘^),  respectively.  However,  if  the  computational  and  storage  loads  are  to  be  truly  constant 
per  pixel,  the  approximation  bandwidth  needed  for  a  desired  approximation  accuracy  must  not  be  an 
increasing  function  of  N.  Figure  13  shows  that  the  approximation  error  remains  constant  over  a  wide  range 
of  N  for  a  fixed  value  of  /?  =  4.  (As  for  Figure  12,  the  errors  e  in  Figure  13  are  computed  by  averaging  over 
ten  WGN  inputs.)  Identical  results  are  obtained  for  other  values  of  (3.  The  independence  of  (3  upon  N  for  a 
desired  solution  accuracy  is  consistent  with  the  analysis  of  Section  4.2,  and  we  expect  when  |n|  +  |s|  <  |c|/2 
that  j3  will  be  independent  of  N  for  a  desired  level  of  approximation  accuracy.  Thus,  the  approximation 
algorithm  has  constant  per  pixel  computational  and  storage  load. 

6  Conclusion 

In  this  paper  we  describe  an  approach  to  the  efficient  implementations  of  2-D  noncausal  HR  filters.  In  the 
past,  2-D  HR  filters  received  comparatively  limited  attention  due  to  the  apparent  difficulty  of  implementing 
these  filters  efficiently.  In  addition  to  efficiency,  we  were  also  motivated  to  consider  filters  specified  by 
boundary  conditions  rather  than  initial  conditions,  as  the  former  are  frequently  the  natural  choice  and  are 
required,  for  example,  if  zero-phase  filtering  is  desired.  Indeed,  a  number  of  methodologies  now  exist  for 
designing  2-D  difference  equations  to  meet  desired  frequency-selective  specifications  [8],  and  we  demonstrated 
that  imposing  boundary  conditions,  such  as  Dirichlet  or  Neumann,  upon  these  difference  equations  leads  to 
the  desired  frequency  selectivity. 

The  approach  we  developed  for  efficiently  implementing  2-D  noncausal  HR  filters  involved  a  combi¬ 
nation  of  two  things:  (a)  the  application  of  concepts  from  the  direct  solution  of  PDE’s  to  the  calculation 
of  the  solution  of  a  2-D  difference  equation;  and  (b)  the  development  of  new  approximations,  motivated 
by  and  appropriate  for  filtering  applications,  that  reduce  the  algorithms’  complexity  to  desired  levels.  In 
particular,  the  algorithms  resulting  from  our  procedure  have  constant  computational  complexity  per  pixel 


32 


and,  if  implemented  in  maximally  parallel  form,  have  total  computation  time  per  pixel  that  decreases  as 
image  size  increases.  In  particular,  our  approximation  is  based  on  the  columnwise  ordering  of  data  points 
and  the  block  LU  factorization  of  the  linear  system  that  results  from  this  ordering.  While  exact  factorization 
is  still  complex  computationally,  the  observation  that  each  successive  block  of  computation  could  be  viewed 
as  a  1-D  filtering  operation  along  a  column  of  the  image  led  to  the  idea  of  a  reduced-order  approximation 
of  each  of  these  1-D  columnwise  filters.  In  matrix  terms,  this  corresponds  to  a  banded  approximation  to 
each  of  the  blocks  in  the  block  LU  factorization,  with  bandwidth  (and  1-D  filter  order)  (3.  The  resulting 
algorithm  was  shown  both  to  achieve  the  computational  levels  mentioned  previously  and  to  yield  excellent 
results  using  small  values  of  (3  for  a  number  of  low-order  frequency-selective  filters. 

The  approach  that  we  have  just  described  is,  in  principle,  applicable  to  a  broad  range  of  filtering 
problems,  e.g.,  those  that  are  higher  order  or  have  non-constant-coefficient  difference  equations.  Indeed,  the 
success  we  have  demonstrated  here  together  with  the  guidelines  we  have  described  for  situations  in  which  our 
approximation  should  work  well  provide  ample  motivation  for  the  application  of  this  methodology.  These 
properties  also  suggest  a  theoretical  investigation  of  general  conditions  on  the  difference  equation  coefficients 
under  which  our  approach  is  guaranteed  to  provide  accurate  answers  for  small  values  of  (3.  Finally,  we  also 
believe  that  there  is  much  more  that  can  be  done  in  adapting  other  numerical  methods,  both  direct  and 
iterative,  for  solving  PDF’s  in  order  to  develop  efficient  procedures  for  2-D  HR  filtering. 

A  Computing  Certain  Elements  of  the  Inverse  of  a  Banded  Ma¬ 
trix 

A  special  application  of  the  results  in  [5]  is  the  ability  to  compute  very  efficiently  certain  elements  of 
the  inverse  of  a  banded  matrix.  Namely,  if  Ap  \s  N  x  N  dimensional  matrix  with  bandwidth  /3  and 
Z  =  then  the  elements  of  Z  which  lie  within  a  bandwidth  /3  of  the  diagonal  can  be  computed  in 

0{f3'^N)  computations.  The  results  of  [5]  are  based  on  the  following  observation;  given  Ap  =  LDU  (where 
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L  and  U  are  unit  lower-triangular  and  upper-triangular,  respectively),  then 


Z  =  D-'^L-^ +  {I -U)Z 

(39) 

Z  =  D-^U-^  +  Z{I  -  L) 

(40) 

From  these  relations,  [Z]ik  for  \l 

—  k\  <  /?  is  given  by  Equation  (41), 

which  does 

not  depend  upon 

computing  any  elements  in  L~ 

^  and  . 

l  =  k 

\Z]ik  =  < 

-Y!lti+Au\im[zUk, 

m  <  k 

(41) 

m  >  k 

For  certain  matrices,  such  as  those  discussed  in  Section  4.1,  the  elements  of  Z  which  fall  within  the  bandwidth 
/?  can  be  seen  as  a  reasonable  approximation  to  Ap. 

To  implement  this  algorithm,  note  that  [Z]nn  niust  be  the  first  element  computed  of  Z.  Also,  to 
compute  any  element  \Z]ik,  all  elements  [^]mn  such  that  m  >  I,  n  >  k,  and  \l  -  k\  <  l3  must  already  have 
been  computed.  With  these  restrictions  placed  on  the  recursion,  the  number  of  computations  to  compute  Z 
within  a  bandwidth  (3  is  bounded  closely  by 

#  computations  <  N[{I3  +  1)  +  2/3^] ,  (42) 

where  the  first  term  represents  the  computations  for  the  diagonal  elements  only.  The  computational  com¬ 
plexity  of  the  algorithm  is  thus  O{0^N). 
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/3  =  2 


Error  for  /3  =  2 


Figure  6:  The  DFT  magnitude  of  the  exact  and  approximate  responses  of  the  2DNC-IIR  low-pass  filter  to 
Equation  (38)  for  (fcijfe)  =  (3,2).  Note  from  the  axes  of  the  two  error  plots  that  the  approximation  error 
for  /3  =  4  is  an  order  of  magnitude  less  than  that  for  P  =  2.  For  /3  =  2,  e  =  9.1  x  10“^  and  77  =  8.6  x  10“^. 
For  ^  =  4,  e  =  9.7  x  10"^  and  77  =  8.7  x  IQ-T 


13  =  2  Error  for  /?  =  2 


Figure  7;  The  DFT  magnitude  of  the  exact  and  approximate  responses  of  the  2DNC-IIR  low-pass  filter  to 
Equation  (38)  for  (^1,^2)  =  (25, 20).  Note  from  the  axes  of  the  two  error  plots  that  the  approximation  error 
for  /?  =  4  is  an  order  of  magnitude  less  than  that  for  (3  =  2.  For  /?  =  2,  e  =  8.0  x  10”^  and  77  =  7.8  x  10“®. 
For  /3  =  4,  e  =  1.1  x  10“®  and  r?  =  1.1  x  10^^. 
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Error  for  /?  =  4 


Figure  8:  The  DFT  magnitude  of  the  exact  and  approximate  responses  of  the  2DNC-IIR  fan  filter  to 
Equation  (38)  for  (^1,^2)  =  (0,20).  Note  from  the  axes  of  the  two  error  plots  that  the  approximation  error 
for  /?  =  4  is  an  order  of  magnitude  less  than  that  for  /?  =  2.  For  0  =  2,  e  =  2.0  x  10“^  and  77  =  2.0  x  10“^. 
For  /?  =  4,  e  =  1.2  X  and  ??  =  1.2  x  10“^. 
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/3  =  2 


Error  for  /3  =  2 


Figure  10:  Exact  and  approximate  responses  of  the  edge-enhancing  filter  given  by  J2  of  Equation  (35) 
to  a  square  pulse.  Note  from  the  axes  of  the  two  error  plots  that  the  approximation  error  for  /3  =  4  is  an 
order  of  magnitude  less  than  that  for  (3  =  2.  For  (3  =  2,  e  =  4.0  x  10~^  and  rj  =  4.3  x  10“^.  For  p  =  4:, 
e  =  3.0  X  lO-"*  and  n  =  3.3  x  10"^. 


P=^2 


Error  for  /?  =  2 


Figure  11:  Exact  and  approximate  responses  of  the  edge- enhancing  filter  given  by  J3  of  Equation  (35)  to  a 
square  pulse.  Note  from  the  axes  of  the  two  error  plots  that  the  approximation  error  for  /?  =  4  is  an  order  of 
magnitude  less  than  that  for  /3  =  2.  For  p  =  2,  e  —  3.6  x  10“^  and  77  =  3.2  x  10“^.  For  ^  =  4,  e  =  6.3  x  10"^ 
and  7]  =  5.7  x  10~^. 


Figure  12;  Approximation  errors  vs.  j3  when  N  =  64.  The  four  example  filters  Hi  are  given  by  the  difference 
equation  coefficients  in  Equations  (35)  and  (36). 


Figure  13;  Approximation  errors  for  /?  =  8.  The  four  example  filters  Hi  are  given  by  the  difference  equation 
coefficients  in  Equations  (35)  and  (36). 


